This is a mock report exploring data from UK Department for Transport’s STATS19 provided by STATS19’s R package1 Lovelace, R., Morgan, M., Hama, L., Padgham, M., Ranzolin, D., & Sparks, A. (2019). stats 19: A package for working with open road crash data. The Journal of Open Source Software, 4(33), 1181. https://doi.org/10.21105/joss.01181. This report serves several purposes:
accidents, casualties and vehicles, although this report only analyses (for now) the first two due to time constraints. have,plotly interactive charts instead of using my beloved ggplot and use mapillary API to get images from the roads where accidents took place.,Disclaimer: This report has been made in 12 hours by someone who had never worked with that kind of dataset before. As a result, it has to be considered as a draft and it might contain mistakes in writing and hasty conclusions.
Table of contents:
Let’s see how many observations do we have as well as the variables’ number and types.
Data summary
| Name | accidents2018 |
| Number of rows | 122635 |
| Number of columns | 31 |
| _______________________ | |
| Column type frequency: | |
| Date | 1 |
| factor | 22 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| datetime | 13 | 1 | 2018-01-01 | 2018-12-31 | 2018-07-05 | 365 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| accident_index | 0 | 1.00 | FALSE | 122635 | 201: 1, 201: 1, 201: 1, 201: 1 |
| police_force | 0 | 1.00 | FALSE | 51 | Met: 25390, Wes: 5490, Ken: 4403, Wes: 4132 |
| accident_severity | 0 | 1.00 | FALSE | 3 | Sli: 97799, Ser: 23165, Fat: 1671 |
| date | 0 | 1.00 | FALSE | 365 | 201: 504, 201: 498, 201: 491, 201: 488 |
| day_of_week | 0 | 1.00 | FALSE | 7 | Fri: 20021, Thu: 18656, Wed: 18397, Tue: 17950 |
| time | 13 | 1.00 | FALSE | 1438 | 17:: 1154, 18:: 1093, 17:: 1086, 16:: 1069 |
| local_authority_district | 0 | 1.00 | FALSE | 380 | Bir: 2614, Lee: 1548, Wes: 1509, Lam: 1287 |
| local_authority_highway | 0 | 1.00 | FALSE | 207 | Ken: 3811, Sur: 3113, Lan: 2676, Ham: 2615 |
| first_road_class | 0 | 1.00 | FALSE | 6 | A: 53499, Unc: 43355, B: 14210, C: 7005 |
| road_type | 0 | 1.00 | FALSE | 6 | Sin: 88323, Dua: 19473, Rou: 7573, One: 3366 |
| junction_detail | 0 | 1.00 | FALSE | 10 | Not: 52076, T o: 35958, Cro: 11422, Rou: 9974 |
| junction_control | 0 | 1.00 | FALSE | 5 | Dat: 54842, Giv: 53259, Aut: 13323, Sto: 750 |
| second_road_class | 52211 | 0.57 | FALSE | 6 | Unc: 48631, A: 12213, B: 4662, C: 4168 |
| pedestrian_crossing_human_control | 0 | 1.00 | FALSE | 4 | Non: 117924, Dat: 3173, Con: 1116, Con: 422 |
| pedestrian_crossing_physical_facilities | 0 | 1.00 | FALSE | 7 | No : 94877, Ped: 9753, Pel: 7169, Zeb: 4583 |
| light_conditions | 0 | 1.00 | FALSE | 5 | Day: 88435, Dar: 24746, Dar: 6120, Dar: 2477 |
| weather_conditions | 0 | 1.00 | FALSE | 10 | Fin: 99221, Rai: 12789, Unk: 3666, Oth: 2603 |
| road_surface_conditions | 0 | 1.00 | FALSE | 6 | Dry: 90546, Wet: 28215, Fro: 1417, Dat: 1223 |
| special_conditions_at_site | 0 | 1.00 | FALSE | 9 | Non: 118495, Dat: 1524, Roa: 1372, Aut: 284 |
| carriageway_hazards | 0 | 1.00 | FALSE | 7 | Non: 119170, Dat: 1325, Oth: 1072, Any: 376 |
| urban_or_rural_area | 1 | 1.00 | FALSE | 3 | Urb: 82583, Rur: 39996, Una: 55 |
| lsoa_of_accident_location | 6445 | 0.95 | FALSE | 27965 | E01: 165, E01: 123, E01: 84, E01: 82 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| longitude | 55 | 1 | -1.26 | 1.40 | -7.27 | -2.19 | -1.15 | -0.14 | 1.76 | ▁▁▅▇▃ |
| latitude | 55 | 1 | 52.43 | 1.38 | 49.91 | 51.47 | 51.89 | 53.39 | 60.76 | ▇▆▁▁▁ |
| number_of_vehicles | 0 | 1 | 1.85 | 0.72 | 1.00 | 1.00 | 2.00 | 2.00 | 24.00 | ▇▁▁▁▁ |
| number_of_casualties | 0 | 1 | 1.31 | 0.76 | 1.00 | 1.00 | 1.00 | 1.00 | 59.00 | ▇▁▁▁▁ |
| first_road_number | 0 | 1 | 836.74 | 1670.33 | 0.00 | 0.00 | 41.00 | 580.00 | 9621.00 | ▇▁▁▁▁ |
| speed_limit | 0 | 1 | 37.11 | 14.07 | 20.00 | 30.00 | 30.00 | 40.00 | 70.00 | ▇▁▁▂▁ |
| second_road_number | 0 | 1 | 291.80 | 1129.17 | -1.00 | 0.00 | 0.00 | 0.00 | 9620.00 | ▇▁▁▁▁ |
| did_police_officer_attend_scene_of_accident | 0 | 1 | 1.29 | 0.47 | -1.00 | 1.00 | 1.00 | 2.00 | 3.00 | ▁▁▇▃▁ |
The table above shows that overall we do not have significant missing data in any of the 16 variables as well as some basic statistics of the (few) numerical variables. Now let’s see the mode for every variable.
Number of values and modes for every variable
Possible research question here: Are professional drivers more prone to suffer an accident? The tables above pose interesting (basic) research questions to be explored. As an example, seeing that the day of the week were most accidents take place is Friday, I would like to know if most accidents happen during weekdays or weekend.
Research quesion here: Are weather and visibility conditions an important factor in accidents?
Surprisingly, most accidents take place on dry conditions with sunny days and good visibility, so, apparently, weather does not have such as big impact as I might have guessed on the first sight, although verifying it would require further analysis.
There have been a total of 122,635 accidents in 2018, out of which a 1% were fatal, 19% were serious, and 80% were slight. However, let’s see how these figures have been evolved through time and if there has been an increase or decrease on the number of accidents.
Same histogram, from 2004 to 2018. Probably 2004 and 2009 introduced some changes in how data was gathered or the consideration of what an accident was.
Histogram of accidents by type and year, from 2009 to 2018.
Wile the number of accidents in UK is high, we can see an overall tendency in number of accidents to decrease over time, but can we observe other patterns?
| year | Fatal | Fatal variation | Serious | Serious variation | Slight | Slight variation | Total | Total variation |
|---|---|---|---|---|---|---|---|---|
| 2009 | 2057 | NA | 21997 | NA | 139500 | NA | 163554 | NA |
| 2010 | 1731 | -18.83% | 20440 | -7.62% | 132243 | -5.4876% | 154414 | -5.919% |
| 2011 | 1797 | 3.67% | 20986 | 2.60% | 128691 | -2.7601% | 151474 | -1.941% |
| 2012 | 1637 | -9.77% | 20901 | -0.41% | 123033 | -4.5988% | 145571 | -4.055% |
| 2013 | 1608 | -1.80% | 19624 | -6.51% | 117428 | -4.7731% | 138660 | -4.984% |
| 2014 | 1658 | 3.02% | 20676 | 5.09% | 123988 | 5.2908% | 146322 | 5.236% |
| 2015 | 1616 | -2.60% | 20038 | -3.18% | 118402 | -4.7178% | 140056 | -4.474% |
| 2016 | 1695 | 4.66% | 21725 | 7.77% | 113201 | -4.5945% | 136621 | -2.514% |
| 2017 | 1676 | -1.13% | 22534 | 3.59% | 105772 | -7.0236% | 129982 | -5.108% |
| 2018 | 1671 | -0.30% | 23165 | 2.72% | 97799 | -8.1524% | 122635 | -5.991% |
As can be seen in the table above, total number of accidents has been decreasing over time and 2018 is the year with less total accidents since 2009. This might seem good news (with plenty of room for improvement, provided that the accidents figures are still high), but we can also observe that there has been a slight increment on serious accidents, being 2018 the year whith most serious accidents in 2009, at the cost of slight accidents. This means that while there is a tendency of fatal accidents to decrease since 2009, it is also true that the number of fatal accidents has been more or less stable during the last 3 years.
Since casualties data frame does not have information about the casualties’ job, we might need to use a proxy to answer the research question that has arisen before after seeing that most accidents take place on Fridays. A possible tentative answer could be provided by combining the time and day of the week.
| Day of the Week | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sunday | 490 | 369 | 321 | 270 | 203 | 176 | 205 | 237 | 296 | 498 | 691 | 802 | 1022 | 1033 | 975 | 1013 | 926 | 923 | 864 | 728 | 574 | 463 | 397 | 322 |
| Monday | 218 | 136 | 125 | 96 | 104 | 174 | 389 | 1000 | 1446 | 973 | 778 | 845 | 984 | 966 | 1082 | 1460 | 1512 | 1630 | 1288 | 845 | 562 | 454 | 397 | 274 |
| Tuesday | 165 | 104 | 67 | 68 | 68 | 146 | 472 | 1015 | 1590 | 1029 | 774 | 848 | 899 | 941 | 996 | 1384 | 1538 | 1740 | 1389 | 942 | 656 | 456 | 403 | 258 |
| Wednesday | 163 | 100 | 95 | 64 | 85 | 171 | 375 | 1103 | 1646 | 953 | 766 | 850 | 985 | 888 | 1063 | 1542 | 1562 | 1705 | 1400 | 942 | 646 | 514 | 456 | 322 |
| Thursday | 176 | 111 | 88 | 77 | 83 | 177 | 422 | 1052 | 1632 | 934 | 851 | 828 | 944 | 1007 | 1047 | 1474 | 1558 | 1732 | 1475 | 957 | 674 | 533 | 476 | 348 |
| Friday | 245 | 148 | 114 | 80 | 83 | 179 | 408 | 941 | 1471 | 894 | 768 | 937 | 1096 | 1169 | 1261 | 1606 | 1714 | 1732 | 1468 | 1117 | 784 | 654 | 590 | 558 |
| Saturday | 380 | 299 | 243 | 180 | 172 | 175 | 210 | 290 | 457 | 594 | 861 | 1000 | 1186 | 1134 | 1130 | 1021 | 1095 | 1153 | 1007 | 948 | 801 | 602 | 547 | 584 |
As can easily be seen in the table above, most accidents take place during peak hours in weekdays and there is a tendency to increase the closer it gets to Friday evening, which is probably the busiest time and when people is more tired. The fact that these rush hours is where most commute take place, makes me think that we might reject the hypothesis that professional drivers are more prone to accident, although more research should be required.
Let’s see how accidents are spatially distributed to see if we can identify hot areas. The following interactive map displays accidents by type, displaying slight accidents, as they the most significant ones.
Accidents by location and type.
This map arises many questions such as What’s the impact of commuting in accidents? or Are roads in less opulated areas poorly maintained? Surprisingly for me, I expected more populated areas to be more prone to accidents, but that is not always the case. In fact, I expected London to be the place where most accidents happened, but it is not the case (although the number of accidents around the city is important and makes me think that most of those accidents are due to commutting). Seeing some areas with less density with a significant number of accidents makes me wonder if that is related to the quality of the roads or if there are big industrial areas that attract more transportation4 In order to answer those questions extra datasets would be required. As an example, it would be interesting to cross accidents’ location with the investment and maintenance of the roads and/or their physical features (eg. from OpenStreetMap)..
On the other hand, having the coordinates of every accident, we could also analyse them at a closer scale. As suggested in the Active Travel Podcast Pilot: Media reporting of Active Travel, it could be interesting to view a picture of the places where accidents took place in order to identify possible correlation with their physical features and the number of accidents and casualties. As a protoype, the following code gets the picture from Mapillary5 Mapillary is a service that provides crowdsorced street level imaginery, in a Google StreetView fashion. Google services could have also been used, but they require a fee to obtain an API key, and that was out of the scope of this example. of the top-5 location whith more casualties, which could be the foundations of a larger research based on machine learning.
# Dataframe preparation.
accidents_by_casualties <- accidents2018 %>%
select(longitude, latitude, number_of_casualties) %>%
arrange(desc(number_of_casualties)) %>%
mutate(id = row_number()) %>%
relocate(id) %>%
head(5)
# Download images from mapillary.
for (i in accidents_by_casualties$id) {
print(paste0("Displaying mapillary image close to lon=",
accidents_by_casualties$longitude[i], " and lat=",
accidents_by_casualties$latitude[i]))
img <- images(closeto =c (accidents_by_casualties$longitude[i],
accidents_by_casualties$latitude[i]), radius=1000,
page=1, per_page=1, print=FALSE)$img_key
get_img(img_key=img, size = "l")
}## [1] "Displaying mapillary image close to lon=-0.818005 and lat=52.43432"

## [1] "Displaying mapillary image close to lon=-4.328339 and lat=55.873593"

## [1] "Displaying mapillary image close to lon=-0.561374 and lat=51.914048"

## [1] "Displaying mapillary image close to lon=0.003746 and lat=52.614004"

## [1] "Displaying mapillary image close to lon=-1.197392 and lat=51.250871"

STATS19 provides a second data set describing the casualties involved in every accident described in the accidents dataset we have just explored before.
Let’s see how many observations do we have as well as the variables’ number and types.
Data summary
| Name | casualties2018 |
| Number of rows | 160597 |
| Number of columns | 16 |
| _______________________ | |
| Column type frequency: | |
| factor | 13 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| accident_index | 0 | 1 | FALSE | 122635 | 201: 59, 201: 29, 201: 23, 201: 20 |
| casualty_class | 0 | 1 | FALSE | 3 | Dri: 103371, Pas: 34794, Ped: 22432 |
| sex_of_casualty | 0 | 1 | FALSE | 3 | Mal: 95252, Fem: 65305, Dat: 40 |
| age_band_of_casualty | 0 | 1 | FALSE | 12 | 26 : 33242, 36 : 24225, 46 : 22454, 21 : 18187 |
| casualty_severity | 0 | 1 | FALSE | 3 | Sli: 133302, Ser: 25511, Fat: 1784 |
| pedestrian_location | 0 | 1 | FALSE | 12 | Not: 138163, In : 9153, Cro: 3603, On : 2308 |
| pedestrian_movement | 0 | 1 | FALSE | 10 | Not: 138163, Cro: 7274, Unk: 6205, Cro: 4648 |
| car_passenger | 0 | 1 | FALSE | 4 | Not: 131009, Fro: 18048, Rea: 11057, Dat: 483 |
| bus_or_coach_passenger | 0 | 1 | FALSE | 6 | Not: 157064, Sea: 2218, Sta: 956, Ali: 160 |
| pedestrian_road_maintenance_worker | 0 | 1 | FALSE | 4 | No : 153620, Not: 6852, Yes: 87, Dat: 38 |
| casualty_type | 7 | 1 | FALSE | 21 | Car: 90913, Ped: 22432, Cyc: 17550, Mot: 7221 |
| casualty_home_area_type | 0 | 1 | FALSE | 4 | Urb: 115934, Dat: 16594, Rur: 15534, Sma: 12535 |
| casualty_imd_decile | 0 | 1 | FALSE | 11 | Dat: 27345, Mor: 16684, Mos: 16007, Mor: 15893 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| vehicle_reference | 0 | 1 | 1.48 | 2.56 | 1 | 1 | 1 | 2 | 999 | ▇▁▁▁▁ |
| casualty_reference | 0 | 1 | 1.40 | 2.70 | 1 | 1 | 1 | 1 | 991 | ▇▁▁▁▁ |
| age_of_casualty | 0 | 1 | 37.06 | 19.66 | -1 | 22 | 34 | 50 | 102 | ▃▇▅▂▁ |
The table above shows that overall we do not have significant missing data in any of the 16 variables as well as some basic statistics of the (few) numerical variables. Now let’s see the mode for every variable.
From the tables above, we can profile the average casualty in 2018 as a male between 26-35 years old, driver of a car that has an accident in urban areas who got slightly injured after the accident. Let’s further explore the casualties’ demographics.
Histogram of casualties’ distribution by age and sex.
At this level of detail, we cannot see notable differences between genders. Both male and female seem to follow the same age distribution, although admittedly, females absolute numbers are notably smaller in all the ages.
Let’s see if both genders follow same distribution according to accident severity.
Histogram of casualties’ distribution by age and sex, grouped by accident severity.
As can be seen in the plots above, the number of young females involved in fatal and severe accidents are much lesser than those to their male equals.
This is the end (for now) of this mock report aimed to know about the STATS19 dataset as well as some new coding. There is still lots of data to be explored that, in turn, will lead to research questions, especially if we combine the different datasets together (thankfully they have an accident_index that will make it possible).
We have seen many unanswered questions in this document, and others that have not been directly mentioned, such as the role of women involved in accidents are usually drivers or not.
Another thing I would love to do is to join vehicles and accidents to see if accidents’ severity follows a similar distribution according to the type of vehicles involved. My hypothesis here is that fatal accidents involving cars will be much higher than those involving bicicles, which I expect them to be quite marginal.
Also, I would love to study the impact of the physical conditions of the highways and environment. Although accidents dataset has some information about it, I don’t think it is enough, so, as an OpenStreetMap contributor and advocate, I would love to combine both datasets.